Recomputation Enabled Efficient Checkpointing
نویسندگان
چکیده
Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Amortizing this overhead becomes especially challenging, considering the growth of expected error rates, as checkpointing frequency tends to increase with increasing error rates. Based on the observation that due to imbalanced technology scaling, recomputing a data value can be more energy efficient than retrieving (i.e., loading) a stored copy, this paper explores how recomputation of data values (which otherwise would be read from a checkpoint from memory or secondary storage) can reduce the machine state to be checkpointed, and thereby reduce the checkpointing overhead. Specifically, the resulting amnesic checkpointing framework AmnesiCHK can reduce the storage overhead by up to 23.91%; time overhead, by 11.92%; and energy overhead, by 12.53%, respectively, even in a relatively small scale system.
منابع مشابه
Covering resilience: A recent development for binomial checkpointing
Nowadays, adjoint methods form a well established approach to compute gradient information in a very efficient way in terms of runtime. However, as soon as the considered process involves any kind of nonlinearity, the memory requirement to compute the corresponding adjoints is in principle proportional to the operation count of the underlying function, see, e.g., [1, Sec. 4.6]. For this reason,...
متن کاملAn Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment
Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...
متن کاملAsynchronous Two-level Checkpointing Scheme for Large-scale Adjoints in the Spectral-element Solver Nek5000
Adjoints are an important computational tool for large-scale sensitivity evaluation, uncertainty quantification, and derivative-based optimization. An essential component of their performance is the storage/recomputation balance in which efficient checkpointing methods play a key role. We introduce a novel asynchronous two-level adjoint checkpointing scheme for multistep numerical time discreti...
متن کاملEnabling user-driven Checkpointing strategies in Reverse-mode Automatic Differentiation
Abstract. This paper presents a new functionality of the Automatic Differentiation (AD) Tool tapenade. tapenade generates adjoint codes which are widely used for optimization or inverse problems. Unfortunately, for large applications the adjoint code demands a great deal of memory, because it needs to store a large set of intermediates values. To cope with that problem, tapenade implements a su...
متن کاملAvoiding recomputation in linkage analysis.
We describe four improvements we have implemented in a version of the genetic linkage analysis programs in the LINKAGE package: subdivision of recombination classes, better handling of loops, better coordination between the optimization and output routines, and a checkpointing facility. The unifying theme for all the improvements is to store a small amount of data to avoid expensive recomputati...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1710.04685 شماره
صفحات -
تاریخ انتشار 2017